MLR Conceptual Questions

  1. State the necessary assumptions for a multiple linear regression model to be valid in terms of conducting hypothesis tests and providing prediction intervals.

    1. There must be a linear relationship between the outcome variable and the independent variables.
    2. The residuals are normally distributed
    3. There cannot be multicollinearity.
    4. The variance of error terms are similar across the values of the independent variables (homoscedasticity) Very generally, appearance to these rules will allow a model to be evaluated in a multiple linear regression setting.
  2. True

  3. Feature selection is intended to select the best subset of predictors. These techniques are designed to explain the data in the best way possible, ideally this means fewer predictors. More predictors add noise and can increase the cost. We can also run into the issue of collinearity by having too many predictors. There are a few different techniques for feature selection: backward elimination, forward selection, and stepwise regression. I think that stepwise regression should not be done without an analyst due to its very nuanced process. The steps aren’t linked to the goals of a study and easily fall victim to missing the optimal model. Additionally, this technique doesn’t really address the question of interest. Stepwise selection tends to pick models that are smaller and have the unintended consequence of removing predictors that may be useful to the question of interest, but due to the nature of stepwise selection, will remove a non-statistically significant, but useful predictor. Feature selection methods on their own cannot be used to create ideal models without an analyst present who can take an objective look at the question of interest and the model.

  4. If the final model includes a third, fourth, and fifth level polynomial term the model will be very complex. Adding more feature to a model will always increase training accuracy (low bias). This model will have high variance, since the model is memorizing the training data. I would expect ASE to be higher on the testing data because the model was fit with more variables than necessary.

##Exercise 1 EDA

#import some libraries
library(ISLR)
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500
Auto$cylinders <- as.factor(Auto$cylinders)
Auto$origin <- as.factor(Auto$origin)
attach(Auto)

summary(Auto)
##       mpg        cylinders  displacement     horsepower        weight    
##  Min.   : 9.00   3:  4     Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   4:199     1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   5:  3     Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   6: 83     Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   8:103     3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60             Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                          
##   acceleration        year       origin                  name    
##  Min.   : 8.00   Min.   :70.00   1:245   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   2: 68   ford pinto        :  5  
##  Median :15.50   Median :76.00   3: 79   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98           amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00           amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00           chevrolet chevette:  4  
##                                          (Other)           :365
t(aggregate(mpg~cylinders,data = Auto,summary))
##             [,1]       [,2]       [,3]       [,4]       [,5]      
## cylinders   "3"        "4"        "5"        "6"        "8"       
## mpg.Min.    "18.00000" "18.00000" "20.30000" "15.00000" " 9.00000"
## mpg.1st Qu. "18.75000" "25.00000" "22.85000" "18.00000" "13.00000"
## mpg.Median  "20.25000" "28.40000" "25.40000" "19.00000" "14.00000"
## mpg.Mean    "20.55000" "29.28392" "27.36667" "19.97349" "14.96311"
## mpg.3rd Qu. "22.05000" "32.95000" "30.90000" "21.00000" "16.00000"
## mpg.Max.    "23.70000" "46.60000" "36.40000" "38.00000" "26.60000"
t(aggregate(mpg~cylinders,data = Auto,mean))
##           [,1]       [,2]       [,3]       [,4]       [,5]      
## cylinders "3"        "4"        "5"        "6"        "8"       
## mpg       "20.55000" "29.28392" "27.36667" "19.97349" "14.96311"
t(aggregate(mpg~cylinders,data = Auto,sd))
##           [,1]       [,2]       [,3]       [,4]       [,5]      
## cylinders "3"        "4"        "5"        "6"        "8"       
## mpg       "2.564501" "5.670546" "8.228204" "3.828809" "2.836284"
par(mfrow=c(1,3))
plot(horsepower,mpg,xlab = "horsepower",ylab = "mpg")
new <- data.frame(horsepower=seq(30,300,.1))
lines(seq(30,300,.1),predict(lm(mpg~horsepower),newdata = new),col="red",lwd=4)
plot(as.factor(cylinders),mpg,xlab="cylinders",ylab="mpg",title="Auto Data Set",col=c(7,32,57,82,107))
plot(weight,mpg)
new2 <- data.frame(weight=seq(1600,5200,1))
lines(seq(1600,5200,1),predict(lm(mpg~weight),newdata = new2),col="red",lwd=4)

pairs(Auto[,-c(2,9)])

pairs(Auto[,-c(2,9)],col=cylinders)

library(car)
## Loading required package: carData
Auto <- Auto[,-9]
full.model <- lm(mpg~.,data=Auto)
vif(full.model)[,3]^2
##    cylinders displacement   horsepower       weight acceleration         year 
##     1.984867    23.271418    10.559221    11.723047     2.684794     1.323483 
##       origin 
##     1.531211

#Question 5

Multicollinearity. appears to be a problem in this dataset. Specifically with the displacement, horsepower, and weight variables. Based on displacement and its very high VIF score, I believe this variable could be removed. Thinking about the dataset, a larger engine probably weighs more and has more displacement and horsepower, so it makes sense these variables are related and would have issues with multicollinearity.

par(mfrow=c(1,3))
plot(horsepower,mpg,xlab = "horsepower",ylab="mpg")
new <- data.frame(horsepower=seq(30,300,.1))
horse.model <- lm(mpg~horsepower)
lines(seq(30,300,.1),predict(horse.model,newdata = new),col="red",lwd=4)
plot(horse.model$fitted.values,horse.model$residuals,xlab = "Fitted Values",ylab = "Residuals")
plot(horsepower,horse.model$residuals,xlab = "Horsepower",ylab = "Residuals")

par(mfrow=c(1,3))
plot(horsepower,mpg,xlab = "horsepower",ylab = "mpg")
new <- data.frame(horsepower=seq(30,300,.1))
horse.model2 <- lm(mpg~horsepower +I(horsepower^2))
plot(horse.model2$fitted.values,horse.model2$residuals,xlab = "Fitted Values",ylab="Residuals")
plot(horsepower, horse.model2$residuals,xlab = "Horsepower",ylab = "Residuals")

par(mfrow=c(1,3))
plot(horsepower,mpg,xlab="horsepower",ylab = "mpg",col=cylinders)
new <- data.frame(horsepower=seq(30,300,.1))
horse.model2 <- lm(mpg~horsepower+I(horsepower^2))
lines(seq(30,300,.1),predict(horse.model2,newdata = new),col="red",lwd=4)
plot(horse.model2$fitted.values,horse.model2$residuals,xlab = "Fitted Values",ylab = "Residuals")
plot(horsepower,horse.model2$residuals,xlab="Horsepower",ylab = "Residuals")

par(mfrow=c(1,4))
plot(horse.model2)

#Question 6

#plot(mfrow=c(1,3))
plot(x=horsepower,y=mpg,xlab = "horsepower",ylab = "mpg",col=cylinders)
new <- data.frame(horsepower=seq(30,300,.1))
horse.model3 <- lm(log(mpg)~horsepower+I(horsepower^2))

lines(seq(30,300,.1),predict(horse.model3,newdata = new),col="red",lwd=4)

plot(horse.model3$fitted.values,horse.model3$residuals,xlab = "Fitted Values",ylab = "Residuals")

plot(horsepower,horse.model3$residuals,xlab="Horsepower",ylab="Residuals")

par(mfrow=c(1,4))
plot(horse.model3)

summary(horse.model3)
## 
## Call:
## lm(formula = log(mpg) ~ horsepower + I(horsepower^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66460 -0.12041  0.00316  0.11349  0.66376 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.402e+00  7.260e-02  60.639  < 2e-16 ***
## horsepower      -1.711e-02  1.255e-03 -13.632  < 2e-16 ***
## I(horsepower^2)  3.901e-05  4.922e-06   7.925 2.44e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1764 on 389 degrees of freedom
## Multiple R-squared:  0.7324, Adjusted R-squared:  0.731 
## F-statistic: 532.2 on 2 and 389 DF,  p-value: < 2.2e-16

There is no clear pattern to the residuals, they appear to be equally distributed throughout the plot and show no obvious shape indicating that the data does not show any departures from normality. The transformation does not seem to correct the constant variance issue.I think that including the quadratic term is still beneficial to this model. BAsed on the summary statistics, the model is significant as a whole and each of the predictors are as well.

Question 7

Question 7, part 2

We must keep in mind that we deleted observations that had three or five cylinder engines, purely for our own convenience. I think that this will make our data a bit noisier because we have selectively removed some data. We have also reduced any interesting trends that may arise from these data points.

library(leaps)
#Part 3
Auto <- Auto[-120,]
set.seed(1234)
index<-sample(1:dim(Auto)[1],192,replace=F)
train<-Auto[index,]
test<-Auto[-index,]
reg.fwd=regsubsets(log(mpg)~displacement+weight+acceleration+year+origin,data=train,method="forward",nvmax=7)

predict.regsubsets =function (object , newdata ,id ,...){
  form=as.formula (object$call [[2]])
  mat=model.matrix(form ,newdata )
  coefi=coef(object ,id=id)
  xvars=names(coefi)
  mat[,xvars]%*%coefi
}

testASE<-c()
#note my index is to 20 since that what I set it in regsubsets

for (i in 1:6){
  predictions<-predict.regsubsets(object=reg.fwd,newdata=test,id=i) 
  testASE[i]<-mean((log(test$mpg)-predictions)^2)
}

par(mfrow=c(1,1))
plot(1:6,testASE,type="l",xlab="# of predictors",ylab="test vs train ASE",ylim=c(0.01,0.05))
index<-which(testASE==min(testASE))
points(index,testASE[index],col="red",pch=10)
rss<-summary(reg.fwd)$rss
lines(1:6,rss/100,lty=3,col="blue")  #Dividing by 100 since ASE=RSS/sample size

reg.final=regsubsets(log(mpg)~displacement+weight+acceleration+year+origin,data=Auto,method="forward",nvmax=6)
coef(reg.final,3)
##   (Intercept)        weight          year       origin2 
##  1.5633405896 -0.0003014851  0.0319115667  0.0483012935
final.model<-lm(log(mpg)~weight+year+origin,data=Auto)
summary(final.model)
## 
## Call:
## lm(formula = log(mpg) ~ weight + year + origin, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42350 -0.07213  0.01214  0.07139  0.38223 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.508e+00  1.448e-01  10.413  < 2e-16 ***
## weight      -2.861e-04  9.369e-06 -30.542  < 2e-16 ***
## year         3.183e-02  1.754e-03  18.145  < 2e-16 ***
## origin2      7.259e-02  1.877e-02   3.868 0.000129 ***
## origin3      5.790e-02  1.870e-02   3.096 0.002101 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1203 on 386 degrees of freedom
## Multiple R-squared:  0.8765, Adjusted R-squared:  0.8752 
## F-statistic: 684.7 on 4 and 386 DF,  p-value: < 2.2e-16
plot(exp(final.model$fitted.values),Auto$mpg,xlab="Predicted",ylab="AvgWinnings")
lines(c(0,400000),c(0,400000),col="red")

head(predict(final.model,Auto,interval="predict"))
##        fit      lwr      upr
## 1 2.733423 2.495769 2.971076
## 2 2.679342 2.441705 2.916978
## 3 2.752881 2.515208 2.990553
## 4 2.753739 2.516066 2.991412
## 5 2.749161 2.511493 2.986829
## 6 2.493921 2.255956 2.731886

Based on the above model, we need to use weight, year, and origin as predictors.

#Part 4

summary(final.model)
## 
## Call:
## lm(formula = log(mpg) ~ weight + year + origin, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42350 -0.07213  0.01214  0.07139  0.38223 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.508e+00  1.448e-01  10.413  < 2e-16 ***
## weight      -2.861e-04  9.369e-06 -30.542  < 2e-16 ***
## year         3.183e-02  1.754e-03  18.145  < 2e-16 ***
## origin2      7.259e-02  1.877e-02   3.868 0.000129 ***
## origin3      5.790e-02  1.870e-02   3.096 0.002101 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1203 on 386 degrees of freedom
## Multiple R-squared:  0.8765, Adjusted R-squared:  0.8752 
## F-statistic: 684.7 on 4 and 386 DF,  p-value: < 2.2e-16
plot(final.model)

Based on the plots above, I think that this model meets all assumptions and is suitable for analysis.